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Sparse Principal Component Analysis (PCA) methods are effi- 
cient tools to reduce the dimension (or the number of variables) of 
complex data. Sparse principal components (PCs) are easier to in- 
terpret than conventional PCs, because most loadings are zero. We 
study the asymptotic properties of these sparse PC directions for 
scenarios with fixed sample size and increasing dimension (i.e. High 
Dimension, Low Sample Size (HDLSS)). Under the previously stud- 
ied spike covariance assumption, we show that Sparse PCA remains 
consistent under the same large spike condition that was previously 
established for conventional PCA. Under a broad range of small spike 
conditions, we find a large set of sparsity assumptions where Sparse 
PCA is consistent, but PCA is strongly inconsistent. The boundaries 
of the consistent region are clarified using an oracle result. 

1. Introduction. Principal Component Analysis (PCA) is an impor- 
tant visualization and dimension reduction tool for High Dimension, Low 
Sample Size (HDLSS) data. However, the linear combinations found by PCA 
typically will involve all the variables, with non-zero loadings, which can be 
challenging to interpret. In order to overcome this weakness of PCA, we 
will study sparse PCA methods that generate sparse principal components 
(PCs), i.e. PCs with only a few non-zero loadings. Several sparse PCA meth- 
ods have been proposed to facilitate the interpretation of HDLSS data, see 



for example Zou, Hastie and Tibshirani (2006) [? ], Shen and Huang (2008) [? 
], Leng and Wang (2009) [? ], Witten, Tibshirani and Hastie (2009) [? ], 
Johnstone and Lu (2009) [? ], Amini and Wainwright (2009) [? ], and Ma 
(2010) [? ]. 

This paper studies the HDLSS asymptotic properties of sparse PCA. 
HDLSS asymptotics are based on the limit, as the dimension d — >■ oo, 
with the sample size n fixed, as originally studied by Hall, Marron and 
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Neeman (2005) [? ] and Ahn et al. (2007) [? ]. Theoretical properties of 
sparse PCA have been studied before under different asymptotic frame- 
works. Leng and Wang (2009) [? ] used the adaptive lasso penalty of Zou, 
Hastie and Tibshirani (2006) [? ] to introduce sparse loadings, and estab- 
lished some consistency result for selecting non-zero loadings when the sam- 
ple size n — > oo, with the dimension d fixed. Johnstone and Lu (2009) [? ] 
considered a single-component spiked covariance model (originally proposed 
by Johnstone (2001) [? ]) and showed that conventional PCA is consistent if 
and only if d(n)/n — > 0; furthermore, under the condition log(d V n)/n — > 0, 
they proved consistency of PCA performed on a subset of variables with 
largest sample variance. Amini and Wainwright (2009) [? ] considered the 
same single-component spiked model, and further restricted the maximal 
eigenvector to have k non-zero entries; they studied the thresholding subset 
PCA procedure of Johnstone and Lu [? ] and the sparsePCA procedure of 
d'Aspremont et al. (2007) [? ], and explored conditions on the triplet (n, d, k) 
under which each procedure can recover the support set of the sparse eigen- 
vector with probability one. Paul and Johnstone [? ] developed the aug- 
mented sparse PCA procedure along with its optimal rate of convergence 
property. Ma [? ] proposed an iterative thresholding procedure for estimating 
principal subspaces that has nice theoretical properties. 

Sparse PCA is primarily motivated by modern data sets of very high di- 
mension; hence we prefer the statistical viewpoint of the High Dimension 
Low Sample Size (HDLSS) asymptotics. Note that this case of d — > oo with 
n fixed was not considered by Johnstone and Lu [? ]. Conventional PCA was 
first studied using HDLSS asymptotics by Ahn et al. [? ] and the most com- 
prehensive current result is Jung and Marron (2009) [? ]. The latter found 
conditions when the first several empirical PC directions would be consis- 
tent or subspace consistent with the corresponding population PC directions. 
This happens when the first several eigenvalues are large enough, compared 
with the rest of the eigenvalues of the population covariance matrix. More- 
over, if the first few eigenvalues are not sufficiently large, all empirical PC 
directions will be strongly inconsistent with their population counterparts 
in the sense that the angle between them will converge to 90 degrees. 

The main contribution of this paper is an exploration of conditions where 
conventional PCA is strongly inconsistent (for scenarios with relatively small 
population eigenvalues), yet sparse PCA methods are consistent. Further- 
more, the mathematical boundaries of the sparse PCA consistency are estab- 
lished by showing strong inconsistency, for even an oracle version of sparse 
PCA, beyond the consistent region. Similar to Johnstone and Lu (2009) [? 
] and Amini and Wainwright (2009) [? ], we focus on the single component 
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spiked covariance model. Our results depend on a spike index, a, defined 
below in the context of Example 1.1, which measures the dominance of the 
first eigenvalue, and on a sparsity index, (3, defined also in Example 1.1, 
which measures the number of non-zero entries of the first population eigen- 
vector. For illustration purposes, we simplify the consistency and strong 
inconsistency results for the exemplary model considered in Example 1.1, 
and summarize them below as functions of a and /3 in Figure 1: 

• Previous Results (dark grey rectangle): Jung and Marron (2009) 
[? ] showed that the first empirical eigenvector is consistent with the 
first population eigenvector when the spike index a is greater than 1. 

• Consistency (white triangle): We will show that sparse PCA is 
consistent even when the spike index a is less than 1, as long as a 
is greater than the sparsity index /3. This is done in Section 2 for a 
simple thresholding method and in Section 3 for the RSPCA method 
proposed by Shen and Huang (2008) [? ]. 

• Strong Inconsistency (black triangle): In Section 4 we show that 
even an oracle sparse PCA procedure is strongly inconsistent with the 
first population eigenvector, when the spike index a is smaller than 
the sparsity index /3. 

• Irrelevant Area (light grey rectangle): The sparsity index f3 can 
not be larger than 1, hence the light grey rectangular area is irrelevant. 

1.1. Notation and Assumptions. All quantities are indexed by the di- 
mension d in this paper. However, when it will not lead to confusion, the 
subscript d will be omitted for convenience. Let the population covariance 
matrix be S^. The eigen-decomposition of is 

S d = U d A d Uj, 

where A^ is the diagonal matrix of the population eigenvalues Ai > A2 > 
. . . > Xd and Ud is the matrix of corresponding population eigenvectors so 
that U d = [tti, • • -,u d ]. 

Assume that X±, . . . , X n are random samples from a ti- dimensional normal 
distribution A r (0, S^). Denote the data matrix by = \X\, . . . , X n ]d X n 
and the sample covariance matrix by = vT^X^XT^.. Then, the sample 
covariance matrix can be similarly decomposed as 

t d = UdkdUj, 

where A^ is the diagonal matrix of the sample eigenvalues Ai > A2 > • • • > 
Xd and Ud is the matrix of the corresponding sample eigenvectors so that 
U d = [ui,...,u d ]. 
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Fig 1. Consistent areas for PC A and sparse PC A, as a function of the spike index a and 
the sparsity index f3, under the single component spiked model considered in Example 1.1. 
Conventional PC A is consistent only on the dark grey rectangle (a > 1), while sparse 
PC A is also consistent on the white triangle (0 < (3 < a < 1). In addition, an oracle 
sparse PCA procedure is strongly inconsistent on the black triangle (0 < a < /3 < 1). The 
light grey rectangular area (0 < a < l,/3 > 1) is not considered because the sparsity index 
P < 1. The dots show the grid points studied in the simulation study of Section 4- 



Let Ui be any sample based estimator of Ui, e.g. Ui = Ui for % = 1, . . . , d. 
Two important concepts from Jung and Marron (2009) [? ] are: 

• Consistency: The direction U{ is consistent with its population coun- 
terpart Ui if 

(1.1) Angle(-Uj, U{) = arccos(|< Uj, Ui >|) A 0, as d — > oo, 

where < •, • > denotes the inner product between two vectors. 

• Strong Inconsistency: The direction Ui is strongly inconsistent with 
its population counterpart Ui if 

Angle(uj, Ui) = arccos(|< Ui,Ui >\) — > — , as d — > oo. 

In addition, we consider another important concept in the current paper: 
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• Consistency with convergence rate d L : The direction Ui is consis- 
tent with its population counterpart U{ with the convergence rate d L 
if |< Ui,Ui >\= 1 + o p (d~ i ), where the notation Gd = o p {d~ L ) means 
that d L Gd — > 0, as d — > oo. 

Example 1.1. A ssuuie that X\, . . . , X n are random, sample vectors from 
a d-dimensional normal distribution iV(0, where the covariance matrix 
T,^ has the eigenvalues as 

X 1 = d a , X 2 = . .. = X d = l,a > 0. 

This is a special case of the single component spike covariance Gaussian 
model considered before by, for example, Johnstone (2001) /? /, Paul (2007) 
P ], Johnstone and Lu (2009) /? J, Amini and Wainwright (2009) /? J. With- 
out loss of generality (WLOG), we further assume that the first eigenvector 
u\ is proportional to the following d-dimensional vector 

wi = (C^,0,...,0) r ) 

where < /3 < 1 and \_d^\ denotes the integer part of dP . (In general the 
non-zero entries do not have to be the first \d@\ elements, neither do they 
need to be equal.) If (3 = 0, the first population eigenvector becomes u\ = 
(1,0,...,0) T . 

For the above model, Jung and Marron (2009) [? ] showed that the first 
empirical eigenvector (the PC direction) u\ is consistent with u\ when a > 1; 
however for a < 1, it is strongly inconsistent. Again, the main point of the 
current paper is an exploration of conditions under which sparse methods 
can lead to consistency when the spike index a < 1, (recall that the first 
eigenvalue Ai = d a ), by exploiting sparsity. Sparsity is quantified by the 
sparsity index /3, where [d 13 ] is the number of non-zero elements of the first 
eigenvector u\. Here we use the above simple example for intuitive illustra- 
tion purposes, to highlight the key findings. More general single component 
spike models will be considered in Sections 2 to 4. 

1.2. Roadmap of the paper. The organization of the rest of paper is as 
follows. For easy access to the main ideas, Section 2 first introduces a simple 
thresholding method to generate sparse PC directions. Section 2.1 shows the 
consistency of the sparse PC directions, obtained by this simple threshold- 
ing method. Section 3 then generalizes these ideas to a current sparse PCA 
method. In particular, we consider the sparse PCA method developed by 



G 



DAN SHEN, HAIPENG SHEN AND J. S. MARRON 



Shen and Huang (2008) [? ], and build its connection to the simple thresh- 
olding method. We then establish the consistency of the sparse PCA method 
under the sparsity and small spike conditions where the conventional PCA is 
strongly inconsistent. Section 4 considers scenarios when the spike index a is 
smaller than the sparsity index /3, and proves the strong inconsistency of an 
appropriate oracle PCA procedure. Section 5 reports some simulation results 
to illustrate both consistency and strong inconsistency of PCA and sparse 
PCA. Section 6 concludes the paper with some discussion of future work 
on extending consistency of sparse PCA to more general distributions. We 
point out that it is challenging to move beyond Gaussianity to get HDLSS 
consistency of sparse PCA. Section 7 contains the proofs of the theorems. 

2. Consistency of a simple thresholding method for sparse PCA 

in HDLSS. In Example 1.1, the first eigenvector of the sample covariance 
matrix u\ is strongly inconsistent with u\ when a < 1, because it attempts 
to estimate too many parameters. Sparse data analytic methods assume 
that many of these parameters are zero, which can allow greatly improved 
estimation of the first PC direction u\. Here, this issue is explored in the 
context of sparse PCA, where u\ = (1,0, .. . ,0) T is an extreme case. The 
sample covariance matrix based estimator, u\, can be improved by exploiting 
the fact that u\ has many zero elements. 

A natural approach is a simple thresholding method where entries with 
small absolute values are replaced by zero. In HDLSS contexts, it is challeng- 
ing to apply thresholding directly to the entries of Hi, because the number 
of them grows rapidly as d — t oo, which naturally shrinks their magnitudes 
given that u\ is a unit vector. Thresholding is more conveniently formulated 
in terms of the dual covariance matrix as used by Jung, Sen and Marron 
(2010) [? ]. 

Denote the dual sample covariance matrix by Sd = -XT^X^ and the 
first dual eigenvector by v\. The sample eigenvector u% is connected with 
the dual eigenvector v\ through the following transformation, 

(2.1) u\ = (ui,i, . . . , ttd,i) T = X^vx, 

and the sample estimate is then given by u\ = ui/||ni|| [? ]. 

Given a sequence of threshold values A, define the thresholded entries as 

J Ui i if \v,i i| > A, 

(2.2) m^i = < ' ' for i = l,...,d. 

I if |uj 5 i| < A, 

Denote u\ = (ui i, . . . , ui,i) T and normalize it to get the simple thresholding 
(ST) estimator uf T = ui/\\ui\\. 



CONSISTENCY OF SPARSE PCA IN HDLSS 



7 



For the model considered in Example 1.1, given an eigenvalue of strength 
a G (0, 1), (recall Ai = d a and ui is strongly inconsistent), below we explore 
conditions on the threshold sequence A under which the ST estimator is 
in fact consistent with u\. First of all, the threshold A can not be too large; 
otherwise all the entries will be zeroed out. It will be seen in Theorem 2.1 
that a sufficient condition for this is A < d^ , where 7 G (0, a). Secondly, the 
threshold A can not be too small, or pure noise terms will be included. A 
parallel sufficient condition is shown to be A > log s (d), where 5 £ 00). 

2.1. Consistency of the simple thresholding method. Below we formally 
establish conditions on the eigenvalues of the population covariance matrix 
Srf and the thresholding parameter A, which give consistency of uf to 
u\. All the technical proofs are provided in Section 7 and the supplement 
materials. 

We begin with considering the extreme sparsity case u\ = (1,0,..., 0) T . 
Suppose that Ai ~ d a , in the sense that < c\ < lim ^^ < lim^oo^ < 
C2, where c\ and C2 are two constants. Similarly, assume Yli=2 ^» ~ As in 
Jung and Marron [? ], denote the measure of sphericity as 

tr 2 (£ d ) _ (EtiA,) 2 

and assume the e-condition: e> ^, i.e 

V d A 2 

(2.3) (de)~ l = ^ l . =1 1 >0,asd^oo. 

(EtiA,) 2 

Now we need to impose the following conditions on the eigenvalues: 

• Assume that lim^oo ^ = c, where c is a non- negative constant, 

and the e-condition is satisfied. These conditions can guarantee that 
the dual matrix Sd has a limit. Hence the first dual eigenvector v\ will 
have a limit and it will then help build up the consistency of uf^. 

• In addition, we need the second eigenvalue A2 to be an obvious distance 
away from the first eigenvalue Ai. If not, it will be hard to distinguish 
the first and second empirical eigenvectors as observed by Jung and 
Marron, among others. In that case the appropriate amount of thresh- 
olding on the first empirical eigenvector becomes unclear. Therefore, 
we assume that A2 ~ d e , where 9 < a. 

Theorem 2.1. Suppose that X\, . . . ,X n are random samples from a d- 
dimensional normal distribution N(0, S^) and the first population eigenvec- 
tor u\ = (1, 0, . . . , 0) T . // the following conditions are satisfied: 
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(a) Ai ~ d a , A2 ~ d e , and Y^%=2 ^ ~ d, where 9 G [0,a) and a G (0, 1], 

(b) for a non-negative constant c, lim^oo ^d 1 = c an d the e-condition (2.3) 
is satisfied, 

(c) log 5 (d) dz < X < di , where 5 G (\, 00) and 7 G (0, a), 
then the simple thresholding estimator uf T is consistent with u\ . 

In fact, u\ = (1,0, .. . , 0) T in Theorem 2.1 is a very extreme case. The 
following theorem considers the general case u\ = (^1,1, . . . ,Ud : i) T , where 
only [d^\ elements of u\ are non-zero. WLOG, we assume that the first [d 13 \ 
entries are non-zero just for notational convenience. 

Define 

(2.4) Zj = (zij, Zd,j) T = {Xjux, . . . , Xju d ) T , j = 1, . . . , n. 

We can show that Zj are iid iV (0, diag{Ai, . . . , A^}) random vectors. In 
addition, let 

(2.5) Wj = (wij,. . . ,w d)j ) T = (X 1 2 z 1J ,...,X d 2 z d j) T , j = l,...,n, 

and the Wj are iid iV(0, I d ) random vectors, where I d is the (i-dimensional 
identity matrix. 

The following additional conditions are needed to ensure the consistency 
of uf T : 

• The non-zero entries of the population eigenvector u\ need to be a 
certain distance away from zero. In fact, if the non-zero entries of the 
first population eigenvector are close to zero, the corresponding entries 
of the first empirical eigenvector would also be small and look like pure 
noise entries. Thus, we assume 

max 1<i< ^/3j (n^il -1 ~ d% , where r]E[0,a). 

• Prom (2.4), we have 

d 

Xj = ^2z i:j Ui, j = l,...,n. 

i=l 

Since z\j has the largest variance Ai, then z\jU\ contributes the most 
to the variance of Xj, j = 1, . . . , n. Note that Z\ jU\ is consistent with 
ui, and so z\jU\ is the key to making the simple thresholding method 
work. So we need to show that the remaining parts 

d 

(2.6) Hj = (hij, . . . , h dJ ) T = ZijUi, j = 1, . . . , n 

i=2 
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have a negligible effect on the direction vector . 
• Suppose that the Hj are iid N(0, A^), where A^ = (rn k i)dxd, for 
j = 1, . . . ,n. A sufficient condition to make their effect negligible is 
the following mixing condition of Leadbetter, Lindgren and Rootzen 
(1983) [? ]: 

(2.7) \m k i\ < m kk ^mu^p\ k _i\, l<k^l<\_dP\, 

where pt < 1 for all t > 1 and pt log(i) — > 0, as t — > oo. This mix- 
ing condition can guarantee that maxi<j< n |/iij| has a quick conver- 
gence rate, as d — > oo. It enables us to neglect the influence of Hj for 
sufficiently large d and make ZijU\ the dominant component, which 
then gives consistency to the first population eigenvector u\. Thus the 
thresholding estimator uf T becomes consistent. 

We now state one of the main theorems: 

Theorem 2.2. Assume that X±, . . . ,X n are random samples from a d- 
dimensional normal distribution N(0, S^). Define Zj, Wj andHj as in (2.4), 
(2.5), and (2.6) for j = 1, ... ,n. The first population eigenvector is u\ = 
(ui^i, . . . , Ud : i) T with Ui t i 0, i = 1, . . . , \d l3 \ , and otherwise u^i = 0. 

If the following conditions are satisfied: 

(a) Ai ~ d a , A2 ~ d e , and Yli=2 ^» ~ d, where 9 G [0,a) and a G (0, 1], 

(b) for a non-negative constant C, lim^oo „ d Al = C and e-condition 

(2.3) is satisfied, 

(c) max 1<i<: ^ lii^il" 1 ~ c?2, where rj £ [0,a), 

(d) Hj satisfies the mixing condition (2. 7), j = 1, . . . , n , 

(e) \og 5 {d) c?2 < A < da , where 5 G (\, 00) and 7 G (9, a — n), 

then the thresholding estimator uf^ is consistent with u\ . 

We offer a couple of remarks regarding the above theorem. First of all, 
the theorem naturally reduces to Theorem 2.1 if we let the sparsity index 
(3 = 0. More importantly, this theorem, and the following ones in Sections 2 
to 4, show that the concepts depicted in Figure 1 hold much more generally 
than just under the conditions of Example 1.1. In particular, in the above 
Theorem 2.2, setting 9 = and i] = (3 would give the results plotted in 
Figure 1. 

In addition, for different thresholding parameter A, the ST estimator 
is consistent with u\ with different convergence rate. This result is stated in 
the following theorem. The notation A = o(d p ) below means that \d~ p — > 
as d — > 00. 
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Theorem 2.3. For the thresholding parameter A = o{d 2 ), where 
<T 6 [0, a — 77 — 9), the corresponding thresholding estimator uf^ is consistent 
with u\, with a convergence rate of d? . 

3. Asymptotic properties of RSPC A. As noted in Section 1, several 
sparse PCA methods have been proposed in the literature. Here we perform a 
detailed HDLSS asymptotic analysis of the sparse PCA procedure developed 
by Shen and Huang (2008) [? ] . For simplicity, we refer to it as the regularized 
sparse PCA, or RSPCA for short. All the detailed technical proofs are again 
provided in Section 7 and the supplement materials. 

We start with briefly reviewing the methodological details of RSPCA. 
(For more details, see [? ].) Given a d-by-n data matrix X^, consider the 
following penalized sum-of-squares criterion: 

(3.1) \\X(d) ~ uv T \\'p + P\(u), subject to \\v\\ = 1, 

where u is a d-vector, v is a unit n-vector, || • ||^ denotes the Frobenius 
norm, and P\(u) = Yli=iPX i s a penalty function with A > being 

the penalty parameter. The penalty function can be any sparsity-inducing 
penalty. In particular, Shen and Huang [? ] considered the soft thresholding 
(or L\ or LASSO) penalty of Tibshirani (1996) [? ], the hard thresholding 
penalty of Donoho and Johnstone (1994) [? ], and the smoothly clipped 
absolute deviation (SCAD) penalty of Fan and Li (2001) [? ]. 

Without the penalty term or when A = 0, minimization of (3.1) can be 
obtained via singular value decomposition (SVD) [? ], which results in the 
best rank-one approximation of as uivf, where u\ and v\ minimize 
the criterion (3.1). The normalized u\ turns out to be the first empirical PC 
loading vector. With the penalty term, Shen and Huang define the sparse PC 
loading vector as u\ = ui/||%|| where ui is now the minimizer of (3.1) with 
the penalty term included. The minimization now needs to be performed 
iteratively. For a given v± in the criterion (3.1), we can get the minimizing 
vector as u\ = h\ [Xi^vi), where h\ is a thresholding function that depends 
on the particular penalty function used and the penalty (or thresholding) 
parameter A. See [? ] for more details. The thresholding is applied to the 
vector X^v\ componentwise. 

Shen and Huang (2008) [? ] proposed the following iterative procedure 
for minimizing the criterion (3.1): 

The RSPCA Algorithm 



1. Initialize: 
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(a) Use SVD to obtain the best rank-one approximation uivf of the 
data matrix -XVrf), where v% is a unit vector. 

(b) Set uf d = U! and v? d = v x . 
Update: 

(a) uT w = h x (X (d) vf d ). 



x 



T r. new 



(b) v ncw - (d) 1 



1 II V T 7/ nOW l 



3. Repeat Step 2 setting u° ld = uf cw and ■u° ld = vf ew until convergence. 

4. Normalize the final uf ew to get it\, the desired sparse loading vector. 



There exists a nice connection between the simple thresholding (ST) 
method of Section 2 and RSPCA. The ST estimator uf T is exactly the 
sparse loading vector u\ obtained from the first iteration of the RSPCA it- 
erative algorithm, when the hard thresholding penalty is used. In particular, 
the first dual eigenvector v\ in (2.1) is just the v\ from the best rank-one 
approximation uivf of the data matrix X^y Then, the application of the 
simple thresholding method to the vector X^yv\ as in (2.2) leads to the 
sparse ST estimator v!p-. This is the same as applying the hard threshold- 
ing penalty in (3.1) to generate the sparse loading vector u\, for the given 
v\. The thresholding parameter A in (2.2) also corresponds to the penalty 
parameter A in (3.1) in the case of the hard thresholding penalty. 

Below we develop conditions under which the sparse RSPCA loading 
vector u\ is consistent with the population eigenvector u\ when a proper 
thresholding parameter A is used. All three of the soft thresholding, hard 
thresholding or SCAD penalties are considered. First, the following theorem 
states conditions when the first step sparse loading vector u\ is consistent 
with u\ under the proper thresholding parameter A. 

Theorem 3.1. Under the assumptions and conditions of Theorem 2.2, 
the first step sparse loading vector u\ is consistent with u\ . 

Theorem 3.1 explores conditions when the first iteration of the iterative 
procedure of RSPCA gives a consistent sparse loading vector ui, with an 
appropriate thresholding parameter A. Similar to the ST estimator, for dif- 
ferent parameters A, u\ is consistent with u\ with different convergence rates. 
The result is given in the following Theorem 3.2. 

ot — rj — c; 

Theorem 3.2. For the thresholding parameter A = o(d 2 ) ; where 
? G [0, a — rj — 9), the sparse loading vector u\ in Theorem 3.1 is consistent 
with u\, with a convergence rate of d? . 
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We then set u° ld to be the consistent sparse loading vector obtained after 
the first iteration of the RSPCA algorithm. We then obtain an updated 
estimate for v\ as vf ew = XT^Ui/\\X7\u^\\. The theorem below studies 
the asymptotic properties of vf ew 



Theorem 3.3. Assume that uf d is consistent with u\ with the conver- 
gence rate d^ , where ? 6 [1 — a, oo). // the e-condition is satisfied, then 

II Will 

where W\ = • ',wi >n ) follows a standard n- dimensional normal dis- 

tribution N(0,I n ) and the Wij are defined in (2.5). 

Since Theorem 3.3 establishes the asymptotic properties of £f ew , we can 
now study the asymptotic properties of the updated sparse loading vector 

-new 

(3-2) Vr = j±— V with uT w = hx(X (d) vT w ), 

\\ u i II 

as defined in the iterative procedure of RSPCA. The following Theorem 3.4 
shows that with a proper choice of the thresholding parameter A, the updated 
sparse loading vector -u new remains to be consistent with the population 
eigenvector u±. 

Theorem 3.4. Under the assumptions and conditions of Theorems 2.2 
and 3.3, the updated sparse loading vector « new in (3.2) is consistent with 

Ul. 

For different threshold parameters A, u new is again consistent with u\ with 
different convergence rates, as seen in the following theorem. 

a — rj — t; 

Theorem 3.5. For the thresholding parameter A = o(d a ), where 
? £ [0, a — 7] — 9), the updated sparse loading vector u\ cw in Theorem 3.4 is 
consistent with u\, with convergence rate d^ . 

According to Theorems 3.2 and 3.5, if a—n— 9 > 1—a, then we can choose 

a — 7] — <; 

the thresholding parameter A = o(d 2 ) and make the updated sparse 
loading vector u^ ew in (3.2) to be consistent with u\ at every updating 
step. 
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4. Strong Inconsistency. We have shown that we can attain consis- 
tency using sparse PCA, when the spike index a is greater than the sparsity 
index (3. This motivates the question of consistency using sparse PCA when 
the spike index a is smaller than the sparsity index (3. To answer this ques- 
tion, we consider an oracle estimator which uses the exact positions of zero 
entries of the population eigenvector u\ . We will show that even this oracle 
estimator is strongly inconsistent with the population eigenvector u\ when 
the spike index a is smaller than the sparsity index (3. Compared with this 
oracle sparse PCA, threshold methods can perform no better because they 
also need to estimate location of the zero entries; hence threshold methods 
will also be strongly inconsistent. 

For Example 1.1, the first [d entries of the population eigenvector 
u\ are known to be the non-zero entries. So we could first find a \_d"\- 
dimensional estimator ii\ through subspace PCA for the [d^J -dimensional 
subspace eigenvector u\ which is proportional to the following \_d^\ -dimensional 
vector, 




u\ = 

Then we get the oracle (OR) estimator for u\ as, 

«? R =((nlf,0) T 

The oracle estimator uf R has the same sparsity as the population eigenvec- 
tor u\. Furthermore, it is strongly inconsistent with u\ when a < f3. 

To make this precise, we study the procedure to generate the oracle esti- 
mator for general single component models. Assume that the first [^J en- 
tries of the population eigenvector u\ are non-zero and the rest are all zero: 
u\ = (ni 5 i, . . . , Urf j i) T ,where u^i ^ 0, i = 1, ...,[<£ J, otherwise Uj.i = 0. 
Let X* = (xij, . . . , £|d/3j j) T ~ iV(0, Ef^,), where Ef^, is the covariance 
matrix of X*, j = 1, . . . , n. Then, the eigen-decomposition of S*^, is 



where A^ is the diagonal matrix of eigenvalues A* > A£ > • • • > A 
and U?p, is the matrix of the corresponding eigenvectors so that UT^, 
hi , u 



\df>\ 



Since the last d — [d@\ entries of the first population eigen- 
vector «i equal zero, it follows that the first eigenvector u\ of E?^, is 
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formed by the non-zero entries of the population eigenvector m, i.e. u\ 
(«l,l, • • • , U[dP\,i) T ■ So we have 

d- Lrf^j 

Ul = («) T ,oT^7o) T 



Consider the following data matrix XT^, = [X*,...,X*], and denote 
the sample covariance matrix by ST^, = n~ 1 X* d p^X'^ d py Then, the sample 
covariance matrix ST^g, can be similarly decomposed as 



where Aj^, is the diagonal matrix of the sample eigenvalues X\ > A2 > 

• • • > ^L rf,3 J an< ^ K 3 ] * S ^ e ma * r ^ x °^ * ne corresponding sample eigenvectors 
so that U? d p, = [ul, . . . ,u* d ]. Then, we define the oracle (OR) estimator as 

d- Id/ 3 J 

(4.1) n? R =((^f,LW^) T 

The following theorem states the main result regarding strong inconsis- 
tency. 

Theorem 4.1. A ssuuie that X\, . . . , X n are random samples from a d- 
dimensional normal distribution N(0,Y>d)- The first population eigenvector 
is u\ = (1*1,1, • • • , u d ,i) T , where 7^ 0,i = 1, . . . , [d^\, otherwise Ui t \ = 0. 
If the following conditions are satisfied: 

(a) Ai ~ d a , A 2 ~ d 9 , X d ~ 1 and £jL 2 A» ~ d, where 6 G [0, § ), 

r&; a < p, 

then the oracle estimator in (4-1) is strongly inconsistent with u\. 

5. Simulations for sparse PCA. Here, we will perform simulation 
studies to illustrate the performance of the ST method and the RSPCA 
with the hard thresholding penalty. Let the sample size n = 25 and the 
dimension d = 10,000. To generate the data matrix X^), we first need to 
construct the population covariance matrix for Xu) that approximates the 
conditions of Theorems 2.2 and 3.1 when the spike index a is greater than 
the sparsity index j3. 

For the population covariance matrix, we consider the motivating model in 
Example 1.1 for the first population eigenvector and the eigenvalues, where 
the first eigenvalue Ai = d a and the rest equal one, i.e. Aj = 1, i > 2. For 
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the additional population eigenvectors Uj, 2 < i < [d 13 ] , let the last d— [d 13 ] 
entries of these eigenvectors be zero. In particular, let the eigenvectors Ui, 
2 < i < [d^\ , be proportional to 

ii % = (l~^~l,-i + l,0,...,0) T . 

After normalizing Uj, we get the z-th eigenvector U{ = Ui/\\iLi\\. For i > |_gPJ , 

let the i-th eigenvector have just one non-zero entry in the i-th position such 
i-1 

that m = ((C^To, 1,0,..., 0) T . 

Then the data matrix is generated as 

d 

where the z% are generated from the n-dimensional standard normal distri- 
bution N(0,I n ). 

We select twenty spike and sparsity pairs (a, (3) that have spike index 
a = {0.2,0.4,0.6,0.8} and sparsity index = {0,0.1,0.3,0.5,0.7}, which 
are shown in Figure 1. We perform the simulation for all twenty spike and 
sparsity pairs. For each spike and sparsity pair (a,/3), we generate 100 real- 
izations of the data matrix X^ d y Results for three representative pairs are 
reported below, and interesting observations are discussed. Additional sim- 
ulation results can be found at [? ]. 

First of all, the plots in Figure 2 summarize the results for the spike and 
sparsity pair (a,f3) = (0.6,0.1), corresponding to one of the square dots in 
the white (consistent) triangular area of Figure 1. For each replication of the 
data matrix X^) and a range of the thresholding parameter A, we obtained 
the ST estimator uf T (Section 2) and the RSPCA estimator u± (Section 4). 
Then we calculate the angle between the estimates (or ui) and the first 
population eigenvector u\ through (1.1). Plotting this angle as a function 
of the thresholding parameter A gives the curve in Panel (A) of Figure 2. 
Since ST and RSPCA have very similar performance in this case, we just 
show the RSPCA plots in Figure 2. The 100 simulation realizations of the 
data matrices generate the one-hundred curves in the panel. We rescale 
the thresholding parameter A as log 10 (A + 10~ 5 ), to help reveal clearly the 
tendency of the angle curves as the thresholding parameter increases. 

In these angle plots, the angles with A = (essentially the left edge of 
each plot) correspond to the ones obtained by the conventional PCA. Note 
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(A) Angle 



(B) Type I Error 



(C) Type II Error 
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Fig 2. Performance summary of RSPCA for spike index a — 0.6 and sparsity index 
ft = 0.1 where consistency is expected. Panel (A) shows angle to the first population 
eigenvector as a function of thresholding parameter \. Panel (B) and (C) are Type I Error 
and Type II Error as a function of X. The vertical dashed and solid lines are the left and 
right bounds of the range of the thresholding parameter, which leads to the consistency of 
RSPCA. These show very good performance of RSPCA within the indicated range, which 
empirically confirms our asymptotic calculation. The circles indicate values at the BIC 
choice of A. 



that these angles are all over 40 degrees which confirms the results of Jung 
and Marron (2009) [? ] that when the spike index a < 1, the conventional 
PCA can not generate a consistent estimator for the population eigenvector 
u\. As A increases, the angle remains stable for a while, then decreases to 
almost degree, before eventually starting to increase to 90 degrees. The 
dashed and solid vertical lines in the angle plots indicate the range of the 
thresholding parameter that gives a consistent estimator for u%, as stated in 
Theorems 2.2 and 3.4. These plots suggest that RSPCA does improve over 
PCA and the indicated thresholding range is very reasonable in this case, 
which in turn empirically validates the asymptotic results of the theorems. 
For each realization of the data, as the thresholding parameter increases, all 
entries will be thresholded out, i.e. become zero, so the sparse PCA estimator 
eventually becomes a d-dimensional zero vector. Hence the angles go to 90 
degrees when the thresholding parameter is large enough. 

Zou, Hastie and Tibshirani (2007) [? ] suggest the use of the Bayesian 
Information Criterion (BIC) [? ] to select the number of the non-zero coef- 
ficients for a lasso regression. Lee et al. (2010) [? ] apply this idea to the 
sparse PCA context. According to [? ], for a fixed v\, minimization of (3.1) 
with respect to u\ is equivalent to minimization of the following penalized 
regression criterion with respect to u\: 

(5.1) WX^-uxvlWl + Pxiux) = ||y-(J d (g)i?i)ui|| 2 + iMui), 
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where Y = (X 1 , . . . ,X d ) T , with X 1 being the i-th row of X/^, and is 
the Kronecker product. Following their suggestion, for the above penalized 
regression (5.1) with a fixed v\, we define 

(5.2) BIC(A) = + ^d/PO, 

where a 2 is the ordinary-least squares estimate of the error variance, and 
df(\) is the degree of sparsity for the thresholding parameter A, i.e. the 
number of non-zero entries in u\. For every step of the iterative procedure of 
RSPCA, we can use BIC (5.2) to select the thresholding parameter and then 
obtain the corresponding sparse PC direction, until the algorithm converges. 

For every angle curve in the angle plots of Figure 2, we use a blue circle to 
indicate the thresholding parameter A that is selected by BIC during the last 
iterative step of RSPCA, and the corresponding angle. In the current a = 
0.6, P = 0.1 context, BIC works well, and all the BIC-selected A values are 
very close, so the 100 circles are essentially over plotted on each other. BIC 
also works well for the other spike and sparsity pairs (a, (3) we considered 
where a > (3, which are shown in [? ]. 

Another measure of the success of a sparse estimator is in terms of which 
entries are zeroed. Type I Error is the proportion of non-zero entries in u\ 
that are mistakenly estimated as zero. Type II Error is the proportion of 
zero entries in u\ that are mistakenly estimated as non-zero. Similar to the 
angle, Type I Error (Type II Error) is also a function of the thresholding 
parameter. For each replication of the data matrix -XV d), we calculate a Type 
I Error (Type II Error) curve. Thus, there are one hundred such curves in 
Panels (B) and (C) of Figure 2, respectively. The dashed and solid lines in 
these two panels are the same as those in Panel (A). Note that for all the 
thresholding parameters in the range indicated by the lines, the errors are 
very small, which is again consistent with the asymptotic results of Theo- 
rems 2.2 and 3.4. Again, the circles in these plots are selected by BIC and 
they have the same horizonal thresholding parameter, as in the angle plots. 
Thus, BIC works well here. BIC also generates similarly very small errors 
for the other spike and sparsity pairs (a, f3) in Figure 1 that satisfy a > (3. 

Next we will compare the relative performance among PCA, ST and 
RSPCA. In almost all cases, ST and RSPCA give better results than PCA 
and in some extreme cases, the three methods have similar poor perfor- 
mance. Although in most cases both ST and RSPCA have similar per- 
formance, however, there are some cases (for example when a = 0.4 and 
(3 = 0.3), where RSPCA performs better than ST. For every replication of 
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the data matrix -X7<2)j we use BIC to select the thresholding parameter, and 
then calculate the ST estimator uf^ and the RSPCA estimator u\. After 
that, we calculate the angle, Type I Error and Type II Error for the three 
estimators, as well as the difference between ST and RSPCA (ST minus 
RSPCA). For each measure, the 100 values are summarized using box plots 
in Figure 3. 



(A) Angle (B) Type I Error (C) Type II Error 

. . . 1 0.002 — . . — 



Jf. + + 1 + + 

so[^ x ■ x + ± t 



J 
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+- 
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0.6 
0.4 



* 

T 0.2 



0.0005 



q | . ^ o [— ^ — —\ I : 

PCA ST RSPCA Diff PCA ST RSPCA Diff PCA ST RSPCA Diff 



Fig 3. Comparison of PCA, ST and RSPCA for spike index a = 0.4 and sparstty index 
ft = 0.3. Panels (A), (B) and (C) respectively contain four angle, Type I Error and Type 
II Error box plots: (i) conventional PCA; (ii) and (in) ST and RSPCA with BIC; (iv) the 
difference between ST and RSPCA. In Panel (A), angles for conventional PCA are gen- 
erally larger than ST and RSPCA which indicates the worse performance of contentional 
PCA. In addition, the angles and Type I Errors for ST are larger than RSPCA and their 
difference box plots furthermore confirm this point, which indicates the better performance 
of RSPCA in this case. Type II Errors for ST and RSPCA are almost the same. 

Panel (A) of Figure 3 shows the box plots of the angles between the 
first population eigenvector u\ and the estimates obtained by PCA, ST and 
RSPCA, as well as the differences between ST and RSPCA. Note that the 
PCA angles are large, compared with ST and RSPCA, indicating the worse 
performance of PCA. The angle of ST seems larger than RSPCA. For a 
deeper view of this comparison, the pairwise differences are studied in the 
fourth box plot of the panel. The angle differences are almost always positive, 
with some differences bigger than 50 degrees, which suggests that RSPCA 
has a better performance than ST. Similar conclusions can be made from 
the box plots of the errors, in Panels (B) and (C) of Figure 3. The box plot 
for PCA is not shown in Panel (C) because the corresponding Type II Er- 
ror almost always equals one, which is far outside the shown range of interest. 

Finally, Theorems 2.2 and 3.4 consider the condition that the spike index 
a is greater than the sparsity index j3. When a is smaller than /?, neither ST 
nor RSPCA is expected to give consistent estimation for the first population 
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(A) Angle (B) Type I Error (C) Type II Error 
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log 10 (U1cr 5 ) log 10 (X+10- 5 ) log 10 (U1cr 5 ) 

Fig 4. Performance summary of RSPCA for spike index a — 0.2 and sparsity index 
ft = 0.7 where strong inconsistency is expected. Panel (A) shows the angle curves, between 
the RSPCA estimator and the first population eigenvector. Same format as Figure 2. Since 
the spike index a = 0.2 is smaller than the sparsity index /3 = 0.7, it follows that the right 
bound (solid line) is smaller than the left bound (dashed line). Thus the theorems do not 
give a meaningful range of the thresholding parameter. As expected, performance is very 
poor for any thresholding parameter A. 



eigenvector u±, as discussed in Section 4. For the spike and sparsity pairs 
(a, f3) such that a < /3, the simulation results also confirm this point. Here, 
we display the simulation plots for the spike and sparsity pair (a, (3) = 
(0.2,0.7) in Figure 4 as a representative of such simulations. Since ST and 
RSPCA have very similar performance here, we just show the simulation 
results for RSPCA. Similar to Figure 2, the circles in Figure 4 correspond to 
the thresholding parameter selected by BIC. From the angle plots, we can see 
that the angles, selected by BIC, are close to 90 degrees, which suggests the 
failure of BIC in this case. In fact, all the angle curves are above 80 degrees. 
Thus, neither ST nor RSPCA generates a reasonable sparse estimator. This 
is a common phenomenon when the spike index a is smaller than the sparsity 
index f3. It is consistent with the theoretical investigation in Section 4. 

Furthermore, the corresponding Type I Error, generated by ST or RSPCA 
with BIC, is close to one. This further confirms that BIC doesn't work when 
the spike index a is smaller than the sparsity index f3. ST and RSPCA 
with A = is just the conventional PCA, and typically will not generate 
a sparse estimator. This entails that the Type I Error and Type II Error, 
corresponding to A = 0, respectively equals zero and one. As the thresholding 
parameter increases, more and more entries are thresholded out; hence Type 
I Error increases to one and Type II Error decreases to zero. 

6. Non-Gaussian Variations. In this paper, we consider HDLSS data 

analysis contexts, using the high dimensional normal distribution. In the fu- 
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ture, we hope to extend our theorems to more general distributions. However, 
this will be challenging because sparse PCA methods may not work in some 
extreme cases. This point is illustrated by the following interesting example. 



Example 6.1. Let a € (0,1) and X = (x\, . . . , Xd) T , where {xi,i = 
l,...,d} are independent discrete random variables with the following dis- 
crete probability distributions: 



XI 

and for i = 2 . . . , d 




with probability I , 



with probability \ ; 



d 4 , with probability d 2 ; 

a + l _ a + l 

— d 4 , with probability d 2 } 
0, with probability 1 — 2d~^. 

Then X has mean and variance- covariance with 

d 

S d = d a uiu\ + ^2 Ukul, 

k=2 

where u\ = (1, 0, . . . , 0) T . 

Suppose that we only have sample size n = 1, i.e. X\ = (x^i, . . . ,2^,1) 
then the first empirical eigenvector 

ui = («i 1, . . . , n rfi i) T = 1 (3^,1, . . . , x iid ) T . 

Under this condition, we have 

P (argmaxj|u iid | = 1) = P > max{|x 2 ,i|, . . . , \x d ,i\}) 

= P{x 2 ,i = 0,...,x djl = 0) 

= fl — 2d — a- J 
— >■ as d 00. 

In particular, the absolute value of the first entry of the empirical eigenvec- 
tor can not be greater than the others with probability 1, so we can not 
always threshold out the right entries which results in the failure of the sim- 
ple thresholding method. Similar considerations apply to other sparse PCA 
methods. 



d-i 
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7. Proofs. 

7.1. Proofs of Theorem 2.2 and Theorem 2.3. In order to prove Theo- 
rem 2.2 and Theorem 2.3, we need the dependent extreme value results from 
Leadbetter, Lindgren and Rootzen (1983) [? ], in particular their Lemma 
6.1.1 and Theorem 6.1.3. 

An immediate consequence of those results is the following proposition. 

Proposition 7.1. Suppose that the standard normal sequence {£i,i = 
1, . . . , [d^\} satisfies the mixing condition (2.7). Let the positive constants 

{ci} be such that ^}=i"' ( 1 — ^( c -i)) ^ s bounded and such that Ci d pi = win 1<i< i d /; 
d > c(log( \d^\ )) 2 for some c > 0. 
Then the following holds: 



\dP\ 

n ^ c *} 

i=l 



\dP\ 
i=l 



0, as d — > oo, 



where $ is the standard normal distribution function. Furthermore, if for 
some j > 0, we have 

\df\ 



5^(1 - $(cj)) — > j, as d-> oo, 



then 



P 



n ^ c ^ 



i=l 



e ■ ? , as ci — > oo. 



Proposition 7.1 is used to control the right side of (2.6) through the fol- 
lowing lemma. 

Lemma 7.1. Suppose that £j ~ iV(0, c^j) satisfies the mixing condi- 
tion (2.7), where 5ij is the covariance of the normal sequence i,j = 

i 

1, \dP\. //C Ld/3 j > (log(Lci /3 J)) <5 ma2; 1 < i < Ld/ 3j5- 2 . ; where 5 G (|,oo), then 
C[jsjWa^<i<L(^jl^l ^ 0, as ci oo. 
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PROOF. Note that for every r > 
(7.1) P C-^max-^^jl&l > r] = P [max^^^j |&| > C^y 



< P 

< P 
= 2P 



{maxi<i<[d"J& > C Ld/ 3 J r}|j{max 1 < i < L ^j(-^) > C^jt 
maxi<i<LdPj6 > CVj r ] + P [max 1 < i < L<8 sj(-^ i ) > C\&yr 



max 







(-' 


n 




i=i ^ 



i=l 

where c is a positive constant. Since 



^(l-<i>( C (log(Kj))< 

i=l 

it then follows from Proposition 7.1 that 



0, as d — > oo, 



(7.2) 



P 



i=l 



n ua? <c(io g (Kj)) 5 



1, as d — > oo. 



From (7.1) and (7.2), we can get 



C \dP\ XaSX l<i<YdP\\^i\ ^ °' aS d ~* °°- 



□ 



Now we will begin the proof of Theorem 2.2 and Theorem 2.3. Denote 
Xi = (x it i,- ■ ■,Xi >n ) T , Zi = (z it i,- ■ ■,Zi !n ) T , Wi = (iUi,i,- • •,Wj i „) r and flj = 
(hi,i, ■ ■ ■, hi >n ), i = 1, • • -,d. 

Note that 



(7.3) 



a ST 
< Ui ,U! > 



l^i=l u i,l u i,l\ _ A l I Z^i= 1 u i,l u i,l I 



'Eti(^i) 2 A-^Eti(^,i) 2 

Below we need to bound the denominator and the numerator of (7.3). 
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We start with the numerator. Since Xi = u^\Zi + Hi,i = 1,- ■ -,d, it 
follows that vf Xi = Ui^vJZi + vfHi, which yields 



w rp ~ rp — 

= t*i,lVi ^1 1 {|bJ'x.|>a} + «1 fl i 1 {|eJ'jf.|>A} 



and 



E"*- 1 "*. 1 = H^l^^ ll {|SfX i |>A}+H^l^^ 1 {|5fX i |>A} 
i=l i=l i=l 

[d"] [dP] 

= + ^^v^ild^^,^} + ^^i^ff,l { | S T| i|>A} . 

i=i i=i 

It follows that 

(7.4) A x 5 | J^ui.iui.il < A^^u^l^l + A^^Ki^l 

i=l i=l i=l 

= i^iI+EEvk^ 

i=l i=l 

and 

(7.5) A 1 2 1 y^Mj,mi,i| 

i=l 

_ x M [dP] 

1=1 1=1 

[^] [^] „ _ i 

> ivfwii - i^ii^^,ii { | C T^|< A} -EE A i *KiM- 
«=i i=i j=i 

Next we will show that 

(7.6) EE A i 2 \ u h^Kj\ = °p( d ~~ *)> where ? G [0, a - r? - 0). 
i=i j=i 

Since iTj = (foij,- • -,h d j) T = J2k=2 z k,j u k, j = V • it follows that 
, , 1 

= J2k=2 u i,kZk,j = J2k=2 u i,kK w k,j ~ N^,^-)- wh ere ^ < A 2 , i = 
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[d 13 ], j = 1, • • •, n. Thus, for fix r 



P 



[dP] n _j 
i=l j=l 



< P 



i=i j=i 



l^ijl >n 1 d ^A^rl^il 



U ^ E'^m - d "' A i ru i,i 

i=l [j=l 



/>+oo I 

2n[d"] / ^exp 

Jcd( a -v-e-';)/-2 V2vr 



i=i j=i 



— > 0, as d — > oo, 



where c is constant. Similar, we can show that 
(7.7) 

and 
(7.8) 



i=i j=i 



E E A i 5 I /i mI 1 {E". 1 i^i>a} = °p(<*~ 

i=[<^]+l j=l 



where ? G [0, a — 77 — 6*). 

Finally, we want to show that 



(7.9) 



J2 u li 1 {\vTx i \<x}=°p( d H 
?=i 



where ? satisfies that d 2 A = o(l). Since we can always find a subse- 
quence of { d Al } and make it convergent to a nonnegative constant, for 

2^2 = 2 A » 



simplicity, we just assume that lim^_ 



Ai 



C.liC = 0, then the spike 



>00 Tr^d \ 
2-^1 = 2 A i 

index a < 1, and Jung and Marron (2009) [? ] showed that 

c^" 1 Sd ^> I n , as d y 00, 

where q = n _1 X^Li -V Since the eigenvector vf of c^Sj can be chosen 
so that they are continuous according to Acker (1974) [? ], it follows that 
vj => vi, as d — >■ 00, where =>■ denotes the convergence in distribution and 
v\ is the first eigenvector of n-dimensional identity matrix. If C = 0, then 
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the spike index a = 1 and Jung, Sen and Marron (2010) [? ] showed that 



1 II Wdl 



, as d — > oo. Therefore, we have 



(7.10) 



v[Wi \=>\ vfm | or ||Wi||,as d — >■ oo. 



Since d a A = o(l), d a Ej=i max i<i<[d' 3 ] l^ijl = °p(l)i an d 

E^M^XilO} ^ Z) dVu *.l 1 {|u i>1 t5fZ 1 |<|(?'H i |+A} 
i=l i=l 

<E rf 



1=1 



C+7J— a , -r.,-^ 

c « 2 ^j=l m ^l<i<[d£]R,jl + c " 2 A 



MWi\<\ 2 max ls .< [dj8] [tii,i| i (E™=i m aXi< i <[^ ] |fti,il+> 

/ 



< 



where c is a constant, it follows that (7.9) is established. 

Then, from (7.4), (7.5), (7.6), and (7.9), we obtain the following result 
about the numerator 



_ 3 [&] 

a x 2 1 Ui^u itl i 

1=1 



(7.11) 

Similarly for the denominator, we have 



vJW!\+O p (d a ). 



(7.12) A" 



\ i=l 



[dfi] 



\ E^M +A i 
\ i=l 



\ i=K]+i 



\ E (^i^r^i) + ^1 
^ i=i 



,^>^) 2 + V £ 1^1 

N *=i i=[d' 3 ]+i 



[<#»] n 

E A r x (EM) 2 + 

i=i i=i 



+ E E A i 'i^i^E^ii^i^}' 

i=[d?]+l 3=1 



2G 
and 
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(7.13) A" 



\ i=l 



\ 1=1 



[dfi] 



> \vlWi\ - \v1Wi\ 



\ i=l 

Combining (7.12), (7.13), (7.7), (7.8) and (7.9), we have 



[dP] n 

i=i j=i 



(7.14) 



_ i 
A 5 



««,l = l«lWi|+Op(d 



Furthermore, (7.3), (7.10), (7.11), and (7.14) suggest that 



< U° L ,Ul > 



*v[Wi\ +o p {d 



mm{t,t } 



|u£Wi| +o p (d 2 ) 



) -, , , mir W} 
- = 1 + o p (d ~ 2 



which means that u\ is consistent with u\ with convergence rate d 2 
This concludes the proof for Theorem 2.2. 



S +77 — a 

In addition, note that d 2 A = oil). If A 



a — rj — s 

o(d 2 ), then we can 



take = f. Then ■Uj !T is consistent with u\ with convergence rate d? . This 
finishes the proof of Theorem 2.3. 

7.2. Proofs of Theorem 3.1, 3.2, 3.3, 3.4 and 3.5. The proofs of Theo- 
rems 3.1, 3.2, 3.4 and 3.5 are modifications of the proofs of Theorems 2.2 
and 2.3. These are provided in the supplementary material, available online 
at [? ]. The proof of Theorem 3.3 is also given in the supplement. 

7.3. Proof of Theorem 4.I. Since X* = (i^j , (0) L ^j x(d _ L ^j ) )A" i , where 
I\iP\ denotes the \ dr J -dimensional identity matrix and (fy\dP\x(d-\dP\) IS ^ ne 
|_o^J-by-(<i — \_dP\~) zero matrix, j = 1, . . . ,n, it follows that 



which yields 



(I[d0\> (°)Lrf' 3 Jx(d-Lrf' 3 J)) s rf(^KJ' (°)Ld' 3 Jx(rf-L'* 3 J)) J 



K = Ai,A 2 >A*> A d ,i = 2,...,L^J. 
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Therefore, 



and 

A? Ai 0(d a ) 

If we rescale A*, i = 1,.. ., J? (7-15) satisfies the £2 assumption of Jung 
and Marron (2009) [? ] and (7.16) satisfies the assumption Ai = 0(d a ) and 
Yli=2^i- = where a < 1. For this case, Jung and Marron (2009) [? ] 

have shown that ii\ is strongly inconsistent with u\. This means that the 
oracle estimator Ui is strongly inconsistent with u\. 
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